Multi-ancestry meta-analysis of host genetic susceptibility to tuberculosis identifies shared genetic architecture

The heritability of susceptibility to tuberculosis (TB) disease has been well recognized. Over 100 genes have been studied as candidates for TB susceptibility, and several variants were identified by genome-wide association studies (GWAS), but few replicate. We established the International Tuberculosis Host Genetics Consortium to perform a multi-ancestry meta-analysis of GWAS, including 14,153 cases and 19,536 controls of African, Asian, and European ancestry. Our analyses demonstrate a substantial degree of heritability (pooled polygenic h2 = 26.3%, 95% CI 23.7–29.0%) for susceptibility to TB that is shared across ancestries, highlighting an important host genetic influence on disease. We identified one global host genetic correlate for TB at genome-wide significance (p<5 × 10-8) in the human leukocyte antigen (HLA)-II region (rs28383206, p-value=5.2 × 10-9) but failed to replicate variants previously associated with TB susceptibility. These data demonstrate the complex shared genetic architecture of susceptibility to TB and the importance of large-scale GWAS analysis across multiple ancestries experiencing different levels of infection pressure.


Introduction
Tuberculosis (TB), caused by Mycobacterium tuberculosis (Mtb) and related species, remains a leading cause of death globally.Around one-quarter of the global population is estimated to show immunological evidence of prior exposure to Mtb (Houben and Dodd, 2016), and in 2019 an estimated 10 million people developed the disease, resulting in 1.4 million deaths (WHO, 2020).This disease burden could be substantially reduced with action to address the social determinants of disease and equitable scale-up of existing interventions.However, tools to prevent, diagnose, and treat TB could be improved if a better understanding of the underpinning pathophysiology could help identify those at greatest risk of the disease.
To address these challenges, we established the International Tuberculosis Host Genetics Consortium (ITHGC) to study the host genetics of disease through collaborative and equitable data sharing (Naranbhai, 2016).The ITHGC includes 12 case-control GWAS from nine countries in Europe, Africa, and Asia (total of 14,153 pulmonary TB cases and 19,536 healthy controls).Inclusion of multiple ancestral groups in a multi-ancestry meta-analysis has the advantage of maximizing power and enhancing fine-mapping resolution to identifying true global associated variants that influence TB susceptibility across population groups.
Here we present the first analyses of the ITHGC dataset exploring host genetic correlates of TB susceptibility using a multi-ancestry meta-analysis approach, including fine-mapping of human leukocyte antigen (HLA) loci and estimation of genetic heritability.

Study overview
In total, 12 GWAS from three major ancestral groups (European, African, and Asian) were included in this study (Table 1; a more detailed table outlining the selection of cases and controls is provided in Supplementary file 1a).All individual datasets were imputed and aligned to the same reference allele before association testing, using an additive genetic model, to obtain odds ratios (OR) and p-values to be used in the meta-analysis.For each individual study (for which we had raw genotyping data), the polygenic heritability was estimated, and HLA alleles were imputed for fine-mapping of the HLA regions.
The summary statistics from the individual GWAS of each dataset were used to conduct a combined, multi-ancestry meta-analysis using MR-MEGA and ancestry-specific (European, African, and Asian) fixed effects (FE) meta-analyses using GWAMA.Finally, the impact of infection pressure on the multiancestry meta-regression was assessed and the concordance in direction of effect for the reference allele between studies was investigated.

Polygenic heritability estimates suggest a genetic contribution to TB disease susceptibility
Twin studies estimate the narrow-sense heritability of susceptibility to TB at up to 80% (Diehl and Von, 1936;Kallmann and Reisner, 1943;Comstock, 1978), but there are few modern estimates.Using raw (unimputed) genotyping data, and assuming population prevalence of disease in each study population equivalent to the reported WHO prevalence rates for that country (WHO, 2020), we estimated polygenic heritability of susceptibility to TB in 10 contributing studies which ranged from 5 to 36% (average of 26.3%, Supplementary file 1b).Comparisons of the heritability estimates between studies from different geographical locations do not take into consideration the differences in environmental pressures between the included studies, and as such these estimates of heritability are only interpretable if the distribution of nongenetic determinants of TB is held constant (Pearce, 2011).Furthermore, variations in phenotype definition can have an impact on heritability estimates (Supplementary file 1a).This is supported by previous research by McHenry et al., 2021a, where significant differences in polygenic heritability estimates were identified between subjects with latent TB infection (LTBI), active TB, and subjects classified as resistors.(McHenry et al., 2021a).As this study includes data with varying methods of classifying TB cases and healthy controls (Supplementary file 1a), there is potential for a degree of heterogeneity and misclassification (between cases and controls) that can have an impact on the heritability estimates.Recent history has seen the near elimination of TB in several countries associated with economic development and public health action.However, while improvement of socioeconomic standing and environment has a stronger impact than host genetics, these crude estimates of polygenic heritability do indicate that TB susceptibility is, in part, heritable.These results require future, more rigorous investigations to narrow down the level of heritable risk and pinpoint genomic loci involved by accounting for population stratification to obtain more accurate heritability estimates.

Multi-ancestry meta-analysis identifies susceptibility loci for TB
For the primary multi-ancestry meta-analysis, MR-MEGA was used as it allows for differences in allelic effects of variants on disease risk between GWAS.Principal components (PCs), derived from a matrix of similarities in allele frequencies between GWAS, were plotted and revealed distinct separation between the three main ancestral groups included in the study (Figure 4) .To account for this, the first two PCs were included as covariates in MR-MEGA as they sufficiently accounted for the allele frequency differences between the study populations, as assessed via a QQ-plot and associated lambda inflation value (Figure 1-figure supplement 1, lambda = 1.00).In total, 26,620,804 variants with a minor allele frequency (MAF) > 1% and present in at least three studies were included in the analysis, of which 3,184,478 were present in all 12 datasets.A significant association peak on chromosome 6 was identified in the HLA class II region (Figure 1).One variant (rs28383206, OR = 0.89, CI = 0.84-0.94,p-value=8.26× 10 -9 ) within this peak was associated with susceptibility to TB at genome-wide significance (p<5.0e - , Figures 1-3, Table 2).Both the residual heterogeneity (p-value=0.012) and ancestry-correlated heterogeneity (p-value=5.28e - ) are significant (p-value<0.05) for the associated variant.However, the evidence of ancestry-correlated heterogeneity is much stronger than for residual heterogeneity, indicating that genetic ancestry contributes more to differences in effects sizes between GWAS than does study design (e.g., phenotyping differences and potential case-control misclassification).The association peak encompasses many HLA-ll genes, including HLA-DRB1/5 (major histocompatibility complex, class II, DR beta 1/5), HLA-DQA1 (major histocompatibility complex, class II, DQ alpha 1), and HLA-DQB3 (major histocompatibility complex, class II, DQ beta 3, Figures 1 and 2).While not reaching genome-wide significance, the HLA class l locus is also indirectly tagged through the association with rs2621322, in the TAP2 (transporter 2, ATP binding cassette subfamily B member) gene, a transporter protein that restores surface expression of MHC class I molecules and has previously been implicated in TB susceptibility (Thu et al., 2016).HLA-A, DQA1, DQB1, DRB1, and TAP2 genes have previously been linked to TB susceptibility through TB candidate gene and GWAS analysis (Thu et al., 2016;Kinnear et al., 2017;Stein et al., 2017;Sveinbjornsson et al., 2016;Zhang et al., 2021).The HLA-II locus encodes several proteins crucial in antigen presentation, including HLA-DR, HLA-DQ, and HLA-DP, which are widely implicated in susceptibility to infection and autoimmunity (Kelly and Trowsdale, 2019;Shiina et al., 2009).

HLA-II
Given the strong association peak in the HLA-ll locus (Figures 1 and 2), we imputed HLA-ll alleles to fine-map this association.HLA alleles were imputed using the HIBAG R package that utilizes both genotyping array and population-specific reference panels to obtain the most accurate imputations for each individual dataset.Association testing was then conducted using an additive genetic model for each individual dataset before meta-analyzing the results (Source data 1, sheets 11-15).
The significant HLA associations overlap with the association peak observed in the multi-ancestry meta-analysis (Figure 2) but show more consistency in the direction of effects between the input studies compared to the lead SNPs identified in the association peak.This suggests that the rs28383206 association in the meta-analysis is tagging an HLA allele, where the different linkage disequilibrium (LD) patterns from the included ancestral populations result in the differences in effects sizes between populations at the rs28383206 association.
This variation in significant associations is, in part, attributable to the observed variation in HLA allele frequencies across all the included studies and may also reflect differential tagging of at least one unknown causal variant across populations (Source data 1, sheets 16-22).The variable role of classical HLA alleles in different populations could be partially due to unique infectious pressures that each geographical region faces and could also explain why different strains of Mtb are more or less prevalent in different regions as they adapted to the HLA profile of the population within this region.Sequencing efforts of global mycobacterial isolates find hyperconservation of class II epitopes, suggesting pathogen advantage achieved through limiting HLA-II recognition and highlighting the potential complex interplay between pathogen and host evolution in modifying class II presentation in TB infection (Comas et al., 2010).Previous work has shown evidence of interaction between genetic variants of the host and specific strains of Mtb in Ghanaian, Ugandan, South African, and Asian populations (Möller and Kinnear, 2020;Müller et al., 2021;Correa-Macedo et al., 2019;Salie et al., 2014;Luo et al., 2015;Wampande et al., 2019;Micheni et al., 2021;McHenry et al., 2021b;McHenry et al., 2020).These interactions provide further evidence that Mtb may have undergone substantial genetic evolution, in concert with host migration and evolution of different populations (Comas et al., 2013;Coscolla and Gagneux, 2014).Some studies suggest that HLA-II epitopes may have undergone regional mutations that modify HLA-II binding, and we speculate that the heterogeneity observed in HLA-II associations between regions may, at least in part, be accounted for by different pressures exerted by varying stains of Mtb (Copin et al., 2016).

Impact of infection pressure on meta-regression
To further understand the heterogeneity across populations, we attempted to account for variation in levels of prior exposure that could serve to mask host effects given that not all controls will have been exposed to Mtb.In low transmission settings, more susceptible but unexposed individuals would be included as controls, who, had they been exposed to Mtb, might have progressed to TB disease.Overall, including each cohort's estimated prevalence of prior exposure had a significant impact on the residual heterogeneity and association statistics of 5% of the variants included in the meta-analysis (419,460/8,355,367), which at a significance level of p-value<0.05 is what is to be expected purely by chance.Separating the results into bins according to p-values revealed that the bins where the covariate had the biggest impact were for p-values in the range of 1e -3 to 1e -5 (Figure 1-figure supplement 2), while significant and suggestive associations reported in this study did not show any significant changes in residual heterogeneity.While the proportion of variants significantly impacted when correcting for infection pressures is low and has the biggest impact on variants with larger p-values, there was still an overall reduction in the chi-square value for the residual heterogeneity (mean chi-square value reduced by 10).This suggests that accounting for potential lifetime of infections does account for some of the observed residual heterogeneity; it is most likely not the main driving force for these residuals.
When considering the impact of force of infection, it is important to consider not only the proportion of controls ever exposed but also the impact of recurrent exposure.There is some evidence to suggest that genetic barriers to progression to TB may be overcome if the infectious dose is high (Fox Figure 2. Regional association plot for the chromosome 6 HLA-ll rs28383206 association in the multi-ancestry analysis revealing a significant peak in the HLA-ll region.Image produced using the online LocusZoom database with linkage disequilibrium (LD) mapping set to 'all' and p-values>0.01 removed (Boughton et al., 2021), and source data file has been uploaded to https://doi.org/10.5061/dryad.6wwpzgn2s.
Figure 3. HLA conditioning analysis.(A) Forest plot (odds ratio and 95% confidence interval) of the significant chromosome 6 association (rs28383206) for tuberculosis (TB) susceptibility in the multi-ancestry analysis, implemented using MR-MEGA with genomic control correction (GCC).Of the 12 studies included, 8 contained this variant.Studies that did not contain the variant are included in the plot but do not have results associated with them.(B) Forest plot for HLA DQA1*02:01 for the eight studies included in the HLA association analysis.Other studies included were obtained from literature searches of previous studies where HLA imputation and association studies were performed (Sveinbjornsson et al., 2016;Li et al., 2021;Zheng et al., 2018).For source data, see   , 1929).Repeated exposure may be observed where TB prevalence is high, as in South Africa, and could contribute to the overall lower effects sizes observed in the GWAS enrolling RSA people.Inclusion of potential lifetime infections in meta-regression could help adjust for these effects and prove useful for not only TB, but meta-analysis of infectious diseases in general, and should be further explored.

Other suggestive loci that did not reach significance
There were four loci with suggestive associations and strong peaks on the Manhattan plot (Figure 1) that did not reach significance but should still be considered as potential variants of interest (Supplementary file 1c).One chr9 peak (rs4576509, p-value=7.40e -0 ) was intergenic (Figure 1-figure supplement 3) while the second (rs6477824, p-value=2.99e -0 ) is located in the 5′-UTR region of the zinc finger protein 483 (ZNF483) gene (Figure 1-figure supplement 3), previously associated with age at menarche (Demerath et al., 2013;Elks et al., 2010).The chromosome 11 peak (rs12362545, p-value=1.24e -0 ) is located in the PPFIA binding protein 2 (PPFIBP2) gene (Figure 1-figure supplement 4), which plays a role in axon guidance and neuronal synapse development and has previously been implicated in cancer development (Colas et al., 2011;Wu et al., 2018).The final peak (rs35787595, p-value=5.41e -0 ), on chromosome 16 (Figure 1-figure supplement 5), is located in the craniofacial development protein 1 (CFDP1) gene region and involved in chromatin organization (Messina et al., 2017).These genes have not been previously linked to TB susceptibility and a potential role is unclear, and as a result further validation of these variants is needed before any conclusions on their impact to TB susceptibility can be drawn.

Ancestry-specific meta-analysis
Concordance in the direction of effects of the risk allele between the ancestry-specific meta-analyses was examined to determine whether significant enrichment (above the expected 50%) exists at different p-value thresholds.Significant enrichment in the concordance of direction of effect was only observed when using the European ancestry as reference compared to the African meta-analysis results for SNPs with p-values>0.001 and <0.01 (p-value=0.0061,Supplementary file 1d).The lack of enrichment between the ancestries suggests significant ancestry-specific associations, which could be further compounded by the differences in local infection pressures.Due to the lack of concordance and the separation of the ancestral populations in the principal component analysis (PCA) plot (Figure 4), ancestry-specific meta-analysis was done.
The PCA plot (Figure 4) for the 12 studies (based on mean pairwise genome-wide allele frequency differences calculated by MR-MEGA) illustrates distinct separation between the three major population groups (Asia, Europe, and Africa).The separation observed between the African studies (Gambia/ Ghana and RSA) is due to the high level of admixture in the RSA population.The RSA population is a five-way admixed South African population with genetic contributions from Bantu-speaking African, KhoeSan, European, and South and South East Asian populations, which explains the observed shift in the PCA plot (Daya et al., 2013; Figure 4).
The online version of this article includes the following source data and figure supplement(s) for figure 3: Source data 1.HLA conditioning analysis data.

Figure supplement 1.
Results for the HLA class I and II meta-analysis of all studies overall (unconditioned) (top) and conditioned on the lead SNP for the six studies in which the lead SNP was present at minor allele frequency (MAF) > 2.5% (bottom).
Figure supplement 1-source data 1.Association statistics used to plot the p-value distribution for the fixed-effects meta-analysis for each HLA locus for both the conditioned and unconditioned analyses.QQ-plots for the ancestry-specific analysis show no significant inflation or deflation.After removing associations without any clear peaks on the Manhattan plots (associations driven by a single study), we found no significant associations for the ancestry-specific analysis.However, suggestive peaks that did not reach genome-wide significance were identified in the European and Asian ancestry-specific analyses (Figure 4-figure supplements 1 and 2, Supplementary file 1e).Potential causes for the     lack of associations and suggestive peaks in the African analysis (Figure 4-figure supplement 3) are the increased genetic diversity within Africa, the inclusion of admixed samples (RSA), and the smaller sample size compared to the other ancestry-specific meta-analysis.While power can be increased through inclusion of greater genetic diversity, between-subpopulation differences in allele frequency can introduce confounding.Confounding by genetic background can result in both spurious associations and the masking of true associations.Such confounding may explain why the results observed elsewhere may not replicate in admixed samples.Removing the admixed data and analyzing only the Gambian and Ghanaian datasets also did not produce any significant results, although, clearly, the sample size was smaller.
The suggestive peaks on chromosomes 6 and 11 in the European subgroup analysis overlap with the suggestive peaks of the multi-ancestry meta-analysis (Figure 1, Figure 4-figure supplement 4, Supplementary file 1e), but the suggestive peak on chromosome 8 is unique to this population (Figure 4-figure supplement 1, Supplementary file 1e).The strongest signal for this peak (rs3935174, OR = 0.87, p-value=1.00e - ) is located in the ArfGAP with SH3 domain, ankyrin repeat, and PH domain 1 (ASAP1) region, which encodes an ADP-ribosylation factor (ARF) GTPase-activating protein and is potentially involved in the regulation of membrane trafficking and cytoskeleton remodeling (Brown et al., 1998).Variants in ASAP1 (rs4733781 and rs10956514) have previously been linked to TB susceptibility in a TB-GWAS analysis of the same Russian population included here (Curtis et al., 2015).While these ASAP1 variants were present in all 12 studies and had consistent direction of effects, they presented with a strong signal in the European ancestry-specific analysis only (African and Asian p-values all ≥ 0.1).These differences in association were not driven by allele frequency differences as they are similar between the included study populations.A possible explanation for the association being observed only in the European meta-analysis is that the association is driven by the Russian dataset.rs4733781 has a strong signal in the Russian dataset (p-value=2.96e - ), but very weak signals in all other populations included in the analysis (p-value>0.01) and is in LD with rs3935174 (r2 = 0.6935 and D' = 0.8791) identified in our analysis.rs4733781 also did not replicate in a previous GWAS from Iceland (Sveinbjornsson et al., 2016), further suggesting that this association is not specific to European populations, but rather driven by the large Russian dataset included in this study.
The suggestive peak on chromosome 8 in the Asian subgroup analysis lies in an intergenic region (Figure 4-figure supplement 2, Supplementary file 1e) and the link to TB susceptibility is unclear.Finally, the suggestive region on chromosome 6 overlaps with the significant peak from the multiancestry analysis (Figure 1 and Figure 4-figure supplement 2) and is located in the major histocompatibility complex, class II, DR beta 1 (HLA-DRB1), as discussed above (Figure 4-figure supplement 2, Supplementary file 1e).

Prior associations
To determine whether associations from previously published TB-GWAS, TB candidate SNPs, and SNPs within candidate gene studies replicate in this meta-analysis, we extracted all significant and suggestive associations from prior analyses and compared these to our multi-ancestry and ancestryspecific meta-analysis results (Luo et al., 2019;Schurz et al., 2018;Chimusa et al., 2014;The Wellcome Trust Case Control Consortium, 2007;Curtis et al., 2015;Mahasirimongkol et al., 2012;Qi et al., 2017;Thye et al., 2010;Thye et al., 2012;Quistrebert et al., 2021;Hong et al., 2017;Zheng et al., 2018;Grant et al., 2016;Png et al., 2012;Daya et al., 2014b).In total, 44 SNPs and 36 genes were identified from the GWAS catalog, of which 33 SNPs and all candidate genes were present in our data (Source data 1, sheet 2).We also extracted the association statistics for a further 90 previously identified candidate genes from our multi-ancestry and population-specific meta-analysis results (Source data 1, sheet 2; Naranbhai, 2016).
Using a Bonferroni-corrected p-value of 0.0015 for the number of SNPs tested (33) as the significance threshold for replication, two candidate SNPs (rs4733781: p-value=3.22e - ; rs10956514: p-value=0.000118;Source data 1, sheets 3 and 4) replicated in the multi-ancestry meta-analysis, both located in the ASAP1 gene region (Curtis et al., 2015;Chen et al., 2019;Wang et al., 2018).However, as discussed in the previous section, these associations are driven by the Russian dataset, which is the same data used by Curtis et al., 2015, where these associations were originally discovered (Curtis et al., 2015).As the Russian population included in our analysis presenting with a strong signal for these variants, there is no independent evidence for these candidate SNPs as they did not replicate in any other population.
For the Asian ancestry-specific analysis, the replicated variant was rs41553512, located in the HLA-DRB5 gene (p-value=3.53E-05).HLA-DRB5 is located within the HLA-ll region identified in the multiancestry meta-analysis (Figure 1) and was previously identified by Qi et al., 2017 in a Han Chinese population.The African ancestry-specific analysis did not replicate previous associations, with the lowest p-value at rs6786408 in the FOXP1 gene (p-value=0.023).While this variant was previously identified in a North African cohort, the fact that it does not replicate here could be because of the genetic diversity within Africa and specifically the variability introduced by the five-way admixed South African population.

Discussion
This large-scale, multi-ethnic meta-analysis of genetic susceptibility to TB, involving 14,153 cases and 19,536 controls, identified one risk locus achieving genome-wide significance, and further investigation of this region revealed significant classical HLA allele associations.This association is noteworthy given we show that there is association in other studies for the same allele (Kinnear et al., 2017;Stein et al., 2017).
Based on the significant association, rs28383206, in the HLA region identified in this multi-ancestry analysis (Figure 3A), HLA-specific imputation and association testing were done to fine-map the region and identify potential HLA alleles driving this association.HLA DQA1*02:01 had the strongest signal in the meta-analysis across the eight included studies (Figure 3B), but this signal disappeared when conditioning on the significant SNP (rs28383206).HLA DQA1*02:01 has previously been identified in an Icelandic and two Chinese populations, but the direction of effect was not consistent (Sveinbjornsson et al., 2016;Li et al., 2021;Zheng, 2018).Despite these inconsistencies, the association between Mtb and HLA class II should be explored in more detail in future studies.A study investigating the outcomes of Mtb exposure in individuals of African ancestry identified protective effects of HLA class II alleles for individuals resistant to TB, highlighting the importance of HLA class II and susceptibility to TB (Dawkins et al., 2022).HLA class II is a key determinant of the immune response in TB, and Mtb has the mechanisms to directly interfere with MHC class 2 antigen presentation (Sia and Rengarajan, 2019).This is supported by studies in mice, where mice in which the MHC class ll genes were deleted died quickly when exposed to Mtb and died faster than the mice in which MHC class I genes were deleted (Sia and Rengarajan, 2019).
The p-values of residual heterogeneity in genetic effects between the studies in the multi-ancestry meta-analysis show no significant inflation between the studies.This suggests that the differences in study characteristics (phenotype definition, infection pressure, Mtb strain) are not the main contributor to the lack of significant associations.However, they certainly have an impact, which is further compounded with ancestry-correlated heterogeneity and other factors (e.g., socioeconomic standing).The ancestry-correlated heterogeneity p-values are generally lower than the residual heterogeneity, suggesting that genetic ancestry has a stronger impact on the differences in effects sizes between the studies.This is supported by the fact that previous TB genetic association studies have identified significant effects of ancestry on TB susceptibility (Chimusa et al., 2014;Daya et al., 2014b).However, the effects of genetic ancestry can be confounded by other factors not accounted for in this analysis, such as the differences in socioeconomic factors (including the differences in housing, employment, poverty, and access to healthcare), phenotype definitions, and differences in infection pressure between the included study populations (Hargreaves et al., 2011;Duarte et al., 2018;Lönnroth et al., 2009).Specifically, the lack of consistency and specificity in TB diagnosis between the included studies introduces heterogeneity and the potential for misclassification of cases and controls, which can reduce the power to detect significant associations (Supplementary file 1a).While this is a limitation of this study, the fact that the residual heterogeneity is overpowered by the ancestry-specific heterogeneity suggests that the phenotype definitions are not the main driver behind the lack of significant associations.For the ancestry-specific analysis, fewer studies result in there being less input heterogeneity to account for, but the reduced sample size was not sufficient to detect any ancestryspecific genome-wide associations.This is particularly evident for the African ancestry-specific metaanalysis where the large degree of heterogeneity, which could be a result of the high genetic diversity within Africa, in combination with differences in socioeconomic factors compared to other populations included in this study, resulted in no observable suggestive association peaks (Campbell and Tishkoff, 2008;Peprah et al., 2015).Furthermore, the suggestive associations (Supplementary file 1c and e) reported in this study should be interpreted with care, and further validation is required before any conclusions can be drawn on the impact that they could have on TB susceptibility.
Polygenic heritability estimates revealed genetic contributions to TB susceptibility for all studies, but the level of this contribution varied greatly (5-36%), suggesting that other factors are contributing to both the lack of significant associations detected in this meta-analysis and the variation observed for the polygenic heritability estimates.These factors likely include environmental, socioeconomic, and varying levels of infection pressures, as well as genetic ancestry-specific effects between the included study populations.An individual from South Africa will face a much higher force of infection than individuals in Europe, and making the assumption that environmental circumstances are equal will significantly skew these crude heritability estimates (Pearce, 2011).This argument is sustained by the fact that increasing disease prevalence (higher infection pressure) increased the level of genetic contribution to TB susceptibility up to a certain point, presumably accounted for by increasingly informative control samples, after which further increasing the infection pressure will not further impact genetic susceptibility.
To determine the impact that force of infection has on the level of genetic contribution to TB susceptibility, we modeled values for proportion of people ever infected with Mtb to include in the multi-ancestry meta-analysis and correct for the different force of infection faced by individuals in each country.Inclusion of this covariate, however, only resulted in a significant difference for 5% of the analyzed variants, what is to be expected based on chance alone, and as such we cannot conclude that a significant portion of the observed residual heterogeneity is explained by this.Limited metadata forced us to make several assumptions about the ages of study participants and the dates on which they were enrolled.With more precise metadata, or Mtb infection test results in controls, the potential impact of lifetime infection could be better quantified and may contribute to elucidating genetic TB susceptibility.Multi-ancestry meta-analysis of other infectious diseases could also potentially benefit from the inclusion of force of infection covariates.It would also be important to determine whether there is a level of exposure beyond which host genetic barriers to infection are overcome (Simmons et al., 2018).
A single significant association was identified in this multi-ancestry meta-analysis, which is small when compared to other meta-analyses of similar size.Factors contributing to this include the difficulty in analyzing multi-ancestry data, the outdated arrays and lack of suitable reference panels for the included study populations, and heterogeneity in case and control definitions between the studies.The issue of heterogeneity in definitions is especially pronounced for this study as it included unpublished data with limited information, which does not indicate how cases were confirmed and controls were collected.The complexity of TB and generally small genetic effects suggests that larger sample sizes or alternative methods of investigation are needed.Utilizing GWAS arrays that better capture diverse populations in combination with imputation making use of larger and more diverse reference panels would allow for larger and more consistent datasets for future meta-analysis.Remapping specific areas of interest such as the HLA, ASAP1, or TLR using long-read sequencing would be invaluable.Increased amounts of genetic data will also allow for more accurate TB heritability analysis and permit analysis of polygenic risk scores and exploration of host-pathogen interactions.
In conclusion, this large-scale multi-ancestry TB GWAS meta-analysis revealed significant associations and shared genetic TB susceptibility architecture across multiple populations from different genetic backgrounds.The analysis shows the value of collaboration and data sharing to solve difficult problems and elucidate what determines susceptibility to complex diseases such as TB.We hope that this publication will encourage others to make their data available for future large-scale meta-analyses.

Methods Data
This analysis includes 12 of the 17 published (and unpublished, Table 1, Supplementary file 1) GWAS of TB (with HIV-negative cohorts) prior to 2022 (Schurz et al., 2018;Chimusa et al., 2014;The Wellcome Trust Case Control Consortium, 2007;Curtis et al., 2015;Mahasirimongkol et al., 2012;Qi et al., 2017;Thye et al., 2010;Thye et al., 2012;Daya et al., 2014b).For unpublished works, we contacted researchers that were funded for genetic TB research and acquired data-sharing agreements to obtain summary statistics (or raw data) along with any metadata that was available.It excludes data from Iceland and Vietnam (Quistrebert et al., 2021) as they declined to share data.It excludes data from China, Korea, Peru, and Japan (Luo et al., 2019;Hong et al., 2017;Li et al., 2021;Zheng, 2018;Sveinbjornsson et al., 2016) as data-sharing agreements could not be finalized in time for this analysis.The Indonesian and Moroccan data were too sparsely genotyped and not suitable for reliable imputation.In addition, the Moroccan data was family-based and thus also not suitable for this meta-analysis as this would introduce confounding effects from the inclusion of related individuals (Grant et al., 2016;Png et al., 2012).Finally, cases and controls are also available within large-scale biobanks, for example, UK Biobank, which could also be leveraged in future iterations of this analysis (Munafò et al., 2018).
Included individuals were genotyped on a variety of genotyping arrays (Table 1, Supplementary file 1), and raw genotyping data was available for eight datasets and for the remainder association testing summary statistics were obtained to use in the meta-analysis (Table 1, Supplementary file 1).Quality control (QC) of raw genotyping data (Table 1, Supplementary file 1) was done using Plink (v1.9), followed by pre-phasing using SHAPEIT and imputation with IMPUTE2 with the 1000 genomes phase 3 reference panel (Chang et al., 2015;Delaneau et al., 2013;Howie et al., 2009;Sudmant et al., 2015).QC and imputation were done as described previously (Schurz et al., 2018;Schurz et al., 2019); briefly, we used a MAF filter of 0.025 and an individual and SNP missingness filter of 0.1.Hardy-Weinberg equilibrium threshold was set at a Bonferroni-corrected p-value according to the number of SNPs testes (0.05/number of SNPs) and samples where sex could not be determined from genotyping were also removed.Imputed data was filtered at a quality score of 0.3, prior to individual and genotype filtration steps.Prior to QC and imputation, allele orientation was corrected using Genotype Harmoniser version 1.4.15, and the genome build of all datasets was checked for consistency (GRCh37) and updated if necessary using the liftOver software from the UCSC genome browser (Deelen et al., 2014;Kent et al., 2002).The four datasets with only summary statistics available (Table 1, Supplementary file 1) were imputed and QC'd during the original investigations, but the marker names and allele orientation were checked for concordance between the summary statistics and the rest of the consortium's imputed data.

Polygenic heritability analysis
To assess the level of genetic contribution to TB susceptibility, we estimated polygenic heritability on the individual studies for which raw genotyping data was available (Table 1, Supplementary file 1).Polygenic heritability estimates were calculated using GCTA (v1.93.2), a genomic risk prediction tool (Yang et al., 2011).The genetic relationship matrix was calculated for each autosomal chromosome.Raw genotype data was pruned for SNPs in LD using a 50 SNP window, sliding by 10 SNPs at a time and removing all variants with LD > 0.5.Samples were filtered by removing cryptic relatedness (--grmcutoff 0.025) and assuming that the causal loci have similar distribution of allele frequencies as the genotyped SNPs (--grm-adj 0).Principal components were then calculated (--pca 20) to include as covariates prior to estimating heritability.Heritability estimations were transformed onto the liability scale using the GCTA software to account for the difference in the proportion of cases in the data compared to the population prevalence (Yang et al., 2011).The average heritability estimate was calculated by taking the mean of all estimates and the confidence intervals were estimated based on the standard error across all studies and the number of studies included.

Meta-analysis
All variants with MAF > 1% and polymorphic in at least three studies (from at least two different ancestries) were included in the primary analysis.For the GWAS, summary statistics of each dataset variants with infinite confidence intervals were removed prior to the meta-analysis.A multi-ancestry meta-analysis plus separate ancestry-specific analyses for Africa, Asia, and Europe were performed.MR-MEGA (Meta-Regression of Multi-Ethnic Genetic Association, v0.20), a meta-analysis tool that maximizes power and enhances fine-mapping when combining data across different ethnicities, was used for the multi-ancestry meta-analysis (Mägi et al., 2017).To account for the expected heterogeneity in allelic effects between populations, MR-MEGA implements a multi-ancestry meta-regression that includes covariates to represent genetic ancestry, obtained from multidimensional scaling of mean pairwise genome-wide allele frequency differences.Genomic control correction (GCC) was implemented during the MR-MEGA analysis for the individual input data (if lambda was >1.05) and output statistics, and the first two PCs, calculated from the genome-wide allele frequency differences, were included as covariates in the regression.QQ-plots of p-values and associated lambda values were used to assess the quality of results prior to downstream investigation.
For the ancestry-specific analyses, the studies were grouped by the major ancestral groups (Table 1, Supplementary file 1) and all variants with a MAF of > 1% that were observed in at least two studies were included in the meta-analysis.We performed traditional fixed-effects meta-analyses in GWAMA (v2.2.2), implementing GCC and assessed the results using QQ-plots (Mägi and Morris, 2010).The genome-wide significance threshold for all association testing was set at p-value=5 × 10 -8 (Panagiotou et al., 2012).

HLA imputation
To fine-map HLA alleles over the HLA locus we imputed HLA class l and ll variants for all 8 studies for which raw data was available (Table 1 and Supplementary file 1).HLA imputation for the HLA class l regions A, B and C as well as the HLA class ll regions DPB1, DRB1, DQB1 and DQA1 was done using the R package HIBAG (version 1.5), implemented in the R free software environment (version 4.0.5)using the predict() command for imputation (R Development Core Team, 2013;Zheng, 2018;Zheng et al., 2014).
The reference datasets for HLA imputation are both genotyping panel and population-specific, and HIBAG has a database of reference data for many genotyping arrays.Each reference panel is also available for either Asian, European, or African populations or a mixture of the three (https://hibag.s3.amazonaws.com/hlares_index.html#estimates).For each dataset included for imputation, the reference panel chosen was the same as the genotyping array used for the data and the reference population was chosen to match the data as closely as possible.Asian and European reference panels were used for the Asian and European populations and African references were used for the Gambia and Ghana datasets, while mixed datasets were implemented for the admixed RSA population.
Following imputation, the HIBAG package (hlaAssocTest) command was used to implement an additive association test for the HLA alleles across the different regions limited to alleles at MAF > 2.5%.Analyses were adjusted for the first four PCs with and without the rs28383206 genotype in the model.Association testing results for the eight included studies were then combined in a fixedeffects meta-analysis using Metasoft software (Han and Eskin, 2011).Ancestry-specific meta-analysis grouped according to the major population groups (Table 1, Supplementary file 1) was also done using the same method.

Estimation of infection pressure
To generate a covariate capturing the likely cumulative exposure to Mtb for included controls, the results of Houben and Dodd, 2016 were adapted to produce a distance matrix to feed into the metaanalysis.The approach in this article fits a Gaussian process model of infection risk history to local data.To represent uncertainty in derived results, a sample of 200 estimated histories of the annual risk of TB infection in each country was used to calculate the expected fraction of control participants ever infected with Mtb, assuming that controls were uniformly aged between 35 and 44 y in 2010, which approximates the period during which controls were recruited for most of the studies.The true age of the controls was not known for all of the datasets, but as quite a substantial skew to the age distribution would be required to have an impact on the results, we believe our choice here is justified.This was done by including estimates for the potential lifetime infections for each source population as a covariate in the MR-MEGA multi-ancestry meta regression.To determine the impact of the covariate, a chi-square difference test was implemented, on an SNP-SNP basis, on the residual and association testing statistics of two meta-analysis output statistics, one including and the other excluding the potential lifetime infections covariate (Satorra and Bentler, 2001).The aim was to determine whether inclusion of potential lifetime infections in the regression explained some of the residual heterogeneity.

Concordance of direction of effect
To determine the degree to which direction of effect is shared for SNPs between the ancestry-specific meta-analysis, we followed the methodology of Mahajan et al., 2014.First, we identified all variants present in all 12 included datasets.Among these SNPs, we then identified an independent subset of variants in the European ancestry-specific meta-analysis showing nominal evidence of association (p-value≤0.001) and separated by at least 500 kb.The identified SNPs were then extracted from the Asian and African ancestry-specific meta-analysis results to calculate the number of SNPs that had the same direction of effect as in the European analysis.To determine whether significant excess in concordance of effect direction was present, a one-sided binomial test was implemented with the expected concordance set at 50%.This analysis was then repeated for other p-value thresholds (0.001<p≤0.01;0.01<p≤0.5;and 0.5<p≤1), and also using the African and Asian meta-analysis results as reference.
The following previously published dataset was used:

Figure 1 .
Figure1.Manhattan plot of p-values (more than three studies) from the MR-MEGA analysis of all 12 datasets with genomic control reveals one significant association in the HLA-ll region of chromosome 6 (rs28383206).Image produced using R scripts provided by MR-MEGA(Mägi et al., 2017), and source data file has been uploaded to https://doi.org/10.5061/dryad.6wwpzgn2s.The online version of this article includes the following source data and figure supplement(s) for figure1:

Figure supplement 1 .
Figure supplement 1. QQ-plot (left) from the MR-MEGA analysis of all 12 datasets with genomic control correction, including two principal components (PCs) as covariates.

Figure supplement 2 .
Figure supplement 2. Proportion of variants that had a significant change in association p-value (based on chi-square difference test) following the inclusion of the force of infection p-value for different p-value bins.

Figure supplement 2
Figure supplement 2-source data 1.Proportions of variants for each p-value bin that had a significant change in association p-value, based on chisquare test for significant difference.

Figure supplement 3 .
Figure supplement 3. Forest plots for the suggestive chromosome 9 peaks, rs4576509 (left) and rs6477824 (right) for the trans-ethnic MR-MEGA analysis including all 12 cohorts.

Figure supplement 3
Figure supplement 3-source data 1.Odds ratios and 95% confidence intervals for each source data file used to plot the forest plot of the two suggestive associations rs4576509 and rs6477824.

Figure supplement 4 .
Figure supplement 4. Forest plots for the suggestive chromosome 11 peak, rs12362545, for the trans-ethnic MR-MEGA analysis including all 12 cohorts.

Figure supplement 4
Figure supplement 4-source data 1.Odds ratios and 95% confidence intervals for each source data file used to plot the forest plot of the suggestive association rs12362545.

Figure supplement 5 .
Figure supplement 5. Forest plots for the suggestive chromosome 16 peak, rs35787595, for the trans-ethnic MR-MEGA analysis including all 12 cohorts.

Figure supplement 5
Figure supplement 5-source data 1.Odds ratios and 95% confidence intervals for each source data file used to plot the forest plot of the suggestive association rs35787595.
Figure3.HLA conditioning analysis.(A) Forest plot (odds ratio and 95% confidence interval) of the significant chromosome 6 association (rs28383206) for tuberculosis (TB) susceptibility in the multi-ancestry analysis, implemented using MR-MEGA with genomic control correction (GCC).Of the 12 studies included, 8 contained this variant.Studies that did not contain the variant are included in the plot but do not have results associated with them.(B) Forest plot for HLA DQA1*02:01 for the eight studies included in the HLA association analysis.Other studies included were obtained from literature searches of previous studies where HLA imputation and association studies were performed(Sveinbjornsson et al., 2016;Li et al., 2021;Zheng et al., 2018).For source data, see Figure3-source data 1.

Figure 3
Figure 3 continued on next page Figure3 continued

Figure 4 .
Figure 4. Principal component analysis (PCA) plot of all 12 studies based on the MR-MEGA mean pairwise genome-wide allele frequency differences.Image produced using the R plot function.For source data, see Figure 4-source data 1.The online version of this article includes the following source data and figure supplement(s) for figure 4: Source data 1.PCA source data.

Figure supplement 1 .
Figure supplement 1. Manhattan plot of all p-values (≥2 studies) for the European subgroup analysis.

Figure supplement 2 .
Figure supplement 2. Manhattan plot of all p-values (≥2 studies) for the Asian subgroup analysis.

Figure supplement 3 .
Figure supplement 3. Manhattan plot of all p-values (≥2 studies) for the African subgroup analysis.

Figure supplement 4 .
Figure supplement 4. QQ-plots for the region-specific fixed effects (FE) meta-analysis using genomic control correction (GCC) and implemented in GWAMA.
GWAS, genome-wide association studies; ITHGC, International Tuberculosis Host Genetics Consortium; Mtb, Mycobacterium tuberculosis; TB, tuberculosis.* Estimated proportion of control individuals ever infected with Mtb by age 35-44 in 2010, based on data from Houben & Dodd.† Raw genotyping data available.‡ RSA(A/M): South African admixed population (RSA) Affymetrix (A) and MEGA (M) array data.

Table 2 .
Significant and suggestive associations (p-value ≤1e -5 ) for the multi-ancestry analysis including data from all 12 datasets implementing MR-MEGA analysis with GCC.